Data analysis (and creating models) involves many stages. For early exploration, it is useful to have a grip not only on individual series (AKA variables) available, but also on relations between them. Unfortunately, the task of understanding corelations between variables proves to be difficult (\(n\) variables means \(n(n-1) / 2\) pairs of variables). Furthermore, the mainstream method of visualizing them (i.e. correlation matrix) has its limits; the more variables, the less readabele (and therefore meaningful) it becomes.
This package aims to plot correlations between variables in form of a graph. Variables correlated with each other shall be close (posivitely and negatively alike), and weakly correlated - far from each other.
It is achieved through a physical simulation, where the nodes are treated as points with mass (and are pushing each other away) and edges are treated as mass-less springs. The length of a spring depends on absolute value of correlation between connected nodes. The bigger the correlation, the shorter the spring.
Let’s take a look at one of datasets available in R - Seatbelts. It contains information about road casualties in Great Britain in 1969-84 period, and has following columns:
DriversKilled - amount of car drivers killeddrivers - amount of car drivers killed or seriously injuredfront - amount of front-seat passangers killed or seriously injuredrear - amount of rear-seat passangers killed or seriously injuredkms - distance drivenPetrolPrice - petrol priceVanKilled - number of van drivers killed.law - binary variable; was the law enforcing seatbelt use in effect?For our purposes, since Pearson’s correlation index is irrelevant for binary variables, we drop the law variable.
library('CorrGrapheR') library('magrittr') df <- as.data.frame(datasets::Seatbelts)[,-8] # Drop the binary variable corrgrapher(df) %>% plot()
As expected, we see, that all variables regarding casualties are correlated with each other, but rear and VanKilled weaker than others. We also observe the negative correlation between PetrolPrice and variables drivers, DriversKilled and front.
Let’s look at something more challenging to visualize. The dataset for FIFA 19 soccer game (more info here and here) contains 89 columns of data about soccer players from all around the world. Visualizing it is a non-trivial task.
Here, let us introduce to a way of combining the CorrGrapheR package with packages from DrWhyAI family. The CorrrapheR function may take an explainer object (created with the help of DALEX package), extract the data from it, and add extra features to the displayed figure.
For this use-case, let us create a model based on numerical variables, that will predict the value in EUR of soccer players.
library("gbm") data('fifa20', package = 'CorrGrapheR') fifa20_selected <- fifa20[,c(4,5,7,8,11:13,17,25:26,45:78)] # Value is skewed. Will be much easier to model sqrt(Value). fifa20_selected$value_eur <- log10(fifa20_selected$value_eur) fifa20_selected$team_position <- factor(fifa20_selected$team_position) fifa20_selected <- na.omit(fifa20_selected) fifa20_selected <- fifa20_selected[fifa20_selected$value_eur > 0,] fifa20_selected <- fifa20_selected[!duplicated(fifa20_selected[,1]),] rownames(fifa20_selected) <- fifa20_selected[,1] fifa20_selected <- fifa20_selected[,-1] # create a gbm model set.seed(1313) # 4:5 are overall and potential, too strong predictors fifa_gbm <- gbm(value_eur ~ . , data = fifa20_selected[,-(4:5)], n.trees = 250, interaction.depth = 3) # Create DALEX explainer fifa_gbm_exp <- DALEX::explain(fifa_gbm, data = fifa20_selected[, -6], y = 10^fifa20_selected$value_eur, predict_function = function(m,x) 10^predict(m, x, n.trees = 250)) fifa_feat <- ingredients::feature_importance(fifa_gbm_exp) fifa_pd <- ingredients::partial_dependency(fifa_gbm_exp) # Finally, create a corrgrapher object fifa_cgr <- corrgrapher(fifa_gbm_exp, cutoff = 0.4, feature_importance = fifa_feat, partial_dependency = fifa_pd)
The visualisation is enriched with:
Thanks to implementation of knit_print() method, an object of class corrgrapher can be displayed simply by calling it:
fifa_cgrThe figure is interactive - feel free to select a variable from a drag-drop selector or to click on the node.
What we can extract from the figure: * The key variables are movement_reactions, age, skill_ball_control, attacking_finishing and skill_dribling * The features containing goalkeepers’ skills are very highly correlated with each other and negatively correlated with the rest * The features contining defenders’ skills are correlated with each other and with mentality_interceptions * movement_sprint_speed is correlated with movement_acceleration * …
In this example, we shall take a look at smaller, artificial dataset containing some information about a population of dragons. It is a useful example, because here we can observe a situation, where correlations are rare.
Once again, let us set up a model, which will predict color of dragon based on the remaining, numerical variables.
library(ranger) data(dragons, package='DALEX') model <- ranger::ranger(colour ~ ., data = dragons, num.trees = 100, probability = TRUE) model_exp <- DALEX::explain(model, data = dragons[,-5], y = dragons$colour)
## Preparation of a new explainer is initiated
## -> model label : ranger ( [33m default [39m )
## -> data : 2000 rows 7 cols
## -> target variable : 2000 values
## -> target variable : Please note that 'y' is a factor. ( [31m WARNING [39m )
## -> target variable : Consider changing the 'y' to a logical or numerical vector.
## -> target variable : Otherwise I will not be able to calculate residuals or loss function.
## -> model_info : package ranger , ver. 0.12.1 , task classification ( [33m default [39m )
## -> predict function : yhat.ranger will be used ( [33m default [39m )
## -> predicted values : predict function returns multiple columns: 4 ( [31m WARNING [39m ) some of functionalities may not work
## -> residual function : difference between y and yhat ( [33m default [39m )
## Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
## factors
## -> residuals : numerical, min = NA , mean = NA , max = NA
## [32m A new explainer has been created! [39m
model_fi <- ingredients::feature_importance(model_exp, loss_function = DALEX::loss_accuracy, type = 'raw') model_pd <- ingredients::partial_dependence(model_exp, N=100, grid_points = 81) dragons_cgr <- corrgrapher(model_exp, feature_importance = model_fi, partial_dependency = model_pd)
dragons_cgrHere we see, that the variables are mostly not corelated. We identify all corelations instantly.
If you wish to save info about single corrgrapher object, use save_cgr_to_html() function:
## NOT RUN save_cgr_to_html(fifa_cgr)
It will produce a HTML file, containing a similar output the one from chunks above.